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Speech reading enhances auditory perception in noise. One means by which this 
perceptual facilitation comes about is through information from visual networks reinforcing 
the encoding of the congruent speech signal by ignoring interfering acoustic signals. We 
tested this hypothesis neurophysiologically by acquiring EEG while individuals listened to 
words with a fixed portion of each word replaced by white noise. Congruent (meaningful) 
or incongruent (reversed frames) mouth movements accompanied the words. Individuals 
judged whether they heard the words as continuous (illusion) or interrupted (illusion 
failure) through the noise. We hypothesized that congruent, as opposed to incongruent, 
mouth movements should further enhance illusory perception by suppressing the auditory 
cortex's response to interruption onsets and offsets. Indeed, we found that the N1 
auditory evoked potential (AEP) to noise onsets and offsets was reduced when individuals 
experienced the illusion during congruent, but not incongruent, audiovisual streams. This 
N1 inhibitory effect was most prominent at noise offsets, suggesting that visual influences 
on auditory perception are instigated to a greater extent during noisy periods. These 
findings suggest that visual context due to speech-reading disengages (inhibits) neural 
processes associated with interfering sounds (e.g., noisy interruptions) during speech 
perception. 
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INTRODUCTION 

The integration of auditory and visual information, such as com- 
bining speech-reading with listening, increases comprehension, 
especially in noisy conditions and in individuals with hearing loss 
(Sumby and Pollack, 1954; Grant and Seitz, 2000; Kaiser et al., 
2003; Ross et al., 2007; Zion Golumbic et al., 2013). Basic research 
has provided insights into the neural functioning of audiovisual 
(AV) integration in speech processing. An emerging theory posits 
that AV integration is partly mediated via temporal alignment of 
the neural response to mouth movements with the response rep- 
resenting the contour of the speech envelope (Chandrasekaran 
and Ghazanfar, 2009; Luo et al, 2010). One test of this the- 
ory is to examine how it fares in adverse acoustic environments 
(Bishop and Miller, 2009; Shahin et al, 2009; Zion Golumbic 
et al., 2013). That is, how does the temporal coherence between 
mouth movements and the speech envelope affect the perception 
of degraded (noise-interrupted) speech? The theory predicts that 
a robust encoding of the speech signal, i.e., the contour of speech 
envelope, should be strengthened by simultaneously disengaging 
(inhibiting) neural processes associated with interfering sounds. 

To this end, Shahin et al. (2012) tested this theory by 
examining the influence of visual context provided during 
speech-reading on illusory filling-in. Illusory filling-in occurs 
when individuals perceive a noise-interrupted sound as contin- 
uous through the noisy segment. Previous accounts (Riecke et al., 



2009; Shahin et al., 2009, 2012) concluded that this phenomenon 
is partly accomplished by suppressing the auditory cortex (AC) 
response to the onsets/offsets of noisy interruptions, creating the 
illusion that the sound (speech) envelope is continuous. Shahin 
et al. (2012) hypothesized that visual context should further 
enhance this inhibitory process, thereby reinforcing illusory per- 
ception. They reasoned that this should be the case because a cou- 
pling of the heard speech with speech-reading enhances encoding 
of the speech envelope in AC (Zion Golumbic et al., 2013). In 
turn, the AC response to noisy signals that do not conform to 
the speech envelope should be inhibited to reduce AC sensitivity 
to interfering signals. To probe this premise, Shahin et al. (2012) 
examined auditory evoked potentials (AEPs) time-locked to the 
onsets and offsets of noise-interruptions while individuals lis- 
tened to noise-interrupted words and judged whether they heard 
the words as continuous (experienced the illusion) or interrupted 
(failed to experience the illusion). The participants made these 
judgments while they watched mouth movements that were con- 
gruent (meaningful), incongruent (reversed frames), or static (no 
movements) with the spoken words. 

Contrary to Shahin et al.'s (2012) prediction, congruent visual 
cues did not weaken AEPs to interruption boundaries (onsets and 
offsets) more so than incongruent or static mouth movements. 
However, individuals tolerated longer interruptions for the con- 
gruent vs. the incongruent and static conditions. This latter result 
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FIGURE 1 | Stimuli. (A) Three frames corresponding to different time 
points along the utterance of the word "direction." (B) The temporal 
waveforms and corresponding spectrograms for the word "direction." The 
left panel depicts the original word with no noise, which was not used in 
the current experiment. The right panel depicts a physically interrupted 
word, where white noise replaced 100% of the fricative / J/. 



was due to a key experimental design parameter, in which inter- 
ruption duration was adaptively allowed to reach the maximum 
duration (ceiling) at which the participant could still perceive 
illusory continuity. Accordingly, one reason for the null effect 
in Shahin et al. (2012) is that variation in interruption duration 
might have masked AEP inhibitory evidence, especially in the con- 
gruent condition. That is, by allowing the noise duration to reach 
a participant's threshold (maximum), AEPs in all conditions may 
have reached the same maximum possible amplitude, creating a 
ceiling effect. 

In the present study, we hypothesized that visually- induced 
AC inhibition might be observed by examining the AC response 
to interruption boundaries in which the interruption duration 
was equal between the congruent and incongruent conditions, but 
always below each participant's threshold. Thus, theoretically the 
interruption duration should be further below ceiling in the con- 
gruent than incongruent condition. Hence, we expect that, in both 
the congruent and incongruent conditions, the AC response to 
interruption boundaries to be smaller during illusory perception 
than when the illusion fails. However, this difference should occur 
to a greater extent in the congruent than incongruent condition, 
as Shahin et al. (2012) originally proposed. Finally, this neuro- 
physiological effect should be reflected behaviorally as an increase 
in the number of trials labeled as illusion in the congruent than 
incongruent condition. 

MATERIALS AND METHODS 
SUBJECTS 

Fourteen native English speakers (average age = 27 years, range 
18-60 years old; 8 females; 13 right handed and 1 left handed) 
with no known hearing problems participated in this study. All 
were between the ages of 18 and 30, except for one who was 
60 years of age. The data from this individual and another par- 
ticipant were not included in the analyses because they had 
fewer than 12 valid trials in one condition of the EEC data. 
Handedness was determined using the Edinburgh Handedness 
Inventory. The study was conducted at the Auditory Neuroscience 
Lab, The Ohio State University Department of Otolaryngology 
and was approved by the local Institutional Review Board. The 
experiments were undertaken with the understanding and writ- 
ten consent of each subject, and the study conformed to The 
Code of Ethics of the World Medical Association (Declaration of 
Helsinki). 

STIMULI 

The auditory and visual stimuli were the same as those used 
in Shahin et al. (2012) (Figures 1A,B). Briefly, the stimuli con- 
sisted of 230 trisyllabic audiovisual words which were segregated 
into auditory (2550 ms) and visual (85 frames) segments. This 
ensured that the extracted frames and corresponding acoustic 
signal covered the entire mouth-movements and speech sig- 
nal, respectively, with several frames with still (closed neutral 
position) lip-movements and silence at the beginning and end. 
There were three conditions: static, congruent and incongruent. 
In the static condition a still picture of the corresponding face 
accompanied auditory presentation of the word. In the con- 
gruent condition, mouth movements were synchronized with 



the acoustic word. In the incongruent condition, the frames 
of the congruent condition were reversed during word pre- 
sentation, which was done to keep the visual motion at the 
same overall energy as the congruent condition. This is impor- 
tant to rule out physical differences in stimuli causing EEC 
effects. 

PROCEDURE AND TASK 
Behavior 

The experiment began with a behavioral (calibration) session 
using the static condition only, in which the maximum thresh- 
old of interruption duration resulting in perception of continuity 
was adaptively measured for each subject. Individuals sat in a 
sound attenuated room 1 meter in front of a 24 inch computer 
monitor and wore insert earphones (ER-4B Etymotic Research, 
Elk Grove Village, IL). Sound level was adjusted to the subject's 
comfort level (range 65 ± 5 dB) and kept constant throughout 
the experiment. Individuals listened to all 230 words randomly 
presented while fixating on a still face (static). All words were 
interrupted by white noise that was of the same intensity +3 dB 
SPL as the replaced segment. The duration of the replaced part 
of the word (i.e., white noise duration), which was always cen- 
tered on a fricative or affricate ([tf], [(fej, [d], [s], [J], [z]), was 
adjusted adaptively from trial to trial. We should note that the 
center of the replaced phoneme was 480 ± 124 ms following voice 
onset, placing it, on average, in the center of the word (words 
had average length of ~ls). Also, the first mouth movement 
was 320 ± 144 ms prior to voice onsets. Because the noise- 
replaced phonemes varied in absolute duration, the duration of 
the noise itself was adjusted as a proportion of the phoneme's 
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total duration. 100% of the fricative/affricative was replaced by 
white noise in the first trial of the adaptive procedure. If individu- 
als responded that the word continued behind the noise (illusion), 
the white noise duration was increased by 15% (7.5% on each 
end of the noise) for the next trial. However, if individuals iden- 
tified the word as interrupted {illusion-failure), noise duration 
was decreased by 15% for the next trial, and so forth. The cal- 
ibration session lasted 15min. Afterwards, the mean duration 
of white noise across trials and across the illusion and illusion- 
failure percepts for each individual was calculated. Following 
the calibration test, a fricative/affricate in all 230 words was 
replaced by white noise of the duration obtained from the cal- 
ibration session for use in the EEG session. It is important 
to emphasize that the noise duration was fixed for an indi- 
vidual, and did not vary across the congruent and incongruent 
conditions. 

EEG 

Continuous EEG data were acquired using a BioSemi ActiveTwo 
system (Amsterdam, Netherlands; 64-channel cap, 10-20 sys- 
tem, Ag-AgCl electrodes, sampled at 512 Hz). The two passive 
electrodes Common Mode Sense (CMS) and the Driven Right 
Leg (DRL) served as ground. There were two blocks of inter- 
rupted word presentations. There were two additional blocks of 
control trials in which intact (non-interrupted) words were pre- 
sented with the congruent and incongruent visual streams. These 
additional blocks were included to test other hypotheses unre- 
lated to the current study. All blocks were randomized between 
subjects to rule out order effects. Each of the main blocks was 
15 min long and contained all 230 noise-inserted words randomly 
presented. In one block, 115 interrupted words were presented 
with congruent visual stimuli and the other 115 words were pre- 
sented with incongruent visual stimuli. In the second block, the 
congruent and incongruent word pairings were reversed. Thus, 
the only difference among these two conditions was the visual 
presentation (the acoustic stimuli, including the phonemes that 
were replaced and white noise durations, were identical). Block 
order was randomized across subjects. Each stimulus presenta- 
tion was followed by a silent period of 1 s along with the still 
picture frame of the last displayed face. The average trial dura- 
tion was set to 4.6 s. However, the inter- stimulus interval (ISI) 
of the spoken words was 2.55m ± 0.32,sd s, since sound onsets 
occurred at variable times in each trial. Figure 2 depicts the 
approximate timing of the unfolding events on a trial. Subjects 
pressed a button with their left index finger when they perceived 
the stimulus as continuous, and pressed another button with their 
left middle finger when they perceived it as interrupted. During 
the experiment, subjects were instructed to focus their attention 
on the talker's mouth and base their decision on the continuity 
(not the meaning) of the spoken word while ignoring the white 
noise. 

DATA ANALYSIS 
Behavior 

The number of illusion and illusion-failure responses in the two 
congruency conditions (congruent, incongruent) was calculated 
for each participant. 



EEG 

Using EEGLAB (Delorme and Makeig, 2004) and in-house 
MATLAB code, continuous EEG files for all blocks were 
concatenated into one grand continuous file and filtered using 
a high pass Butterworth filter (>0.5Hz). This file was then 
epoched into trials (regardless of condition type) from —500 
to 4000 ms around trial onset (onset of 1st frame of the 
video). Trials were then average referenced, baselined to the 
500 ms pre-stimulus period and corrected for ocular artifacts 
using independent component analysis (ICA). Then the tri- 
als were re-epoched from —200 to 1500 ms around the onsets 
of interruptions. Trials containing amplitudes of ±150 [iW or 
greater in any channel were rejected. Data were then sepa- 
rated according to percept type (illusion and illusion-failure) and 
congruency condition (congruent, incongruent) and re-epoched 
around the onsets and offsets of interruptions between —200 
to 500 ms. This re-epoching allowed us to examine electrophys- 
iological changes between percepts and congruency conditions 
as a function of time-locking condition (onset or offset of 
interruption). There was no further baselining. Thus, the same 
baseline was maintained for both the onset and offset time- 
locking conditions. This was done to make sure that possible 
effects differentiating the two conditions were not attributed 
to different baseline periods. The mean number of trials in 
each condition that was included in the analysis was as follows: 
congruent — illusion = 148mn ± 40sd, congruent — illusion — 
failure = 61mn ± 33sd, incongruent — illusion = 113mn ± 35sd> 
incongruent — illusion — failure = 90mn ± 30sj> The overall 
number of illusion trials (~130) exceeded that of illusion-failure 
(~75) trials. This discrepancy is not unexpected and shows that 
individuals experienced the illusion more often than its failure. 
Finally, auditory evoked potentials (AEPs) were generated by 
averaging trials for each subject, channel, time-locking condition, 
congruency condition and percept type. The mean potential of 
each individual/condition was subtracted from the AEP averages. 
Data from two subjects were not included in the final EEG anal- 
ysis because they had too few trials (<30 trials) in one of the 
conditions. 

We limited our AEP analyses to the Nl and P2 AEPs since 
both of these AEPs exhibited changes with illusory perception in 
Shahin et al. (2012). We concentrated on two regions of inter- 
est (ROIs) where auditory activity is known to be dominant, the 
fronto-central region (channels Fl, Fz, F2, FC1, FCz, FC2) and 
centro-parietal region (channels CI, Cz, C2, CP1, CP2, CP3). To 
obtain peak amplitudes we adopted a technique motivated by 
Clayson et al. (2013): 1) Peak latencies of the Nl and P2 AEPs 
were obtained from the group mean for each ROI and condition. 
2) Individual Nl or P2 mean peak amplitudes (±10 ms) centered 
on the group latency values in step (1) were obtained for each 
ROI and condition. This method led to the following windows 
being used. For the Nl AEP at centro-parietal ROI, the window 
of analysis was consistent across conditions due to the consistent 
latency of the Nl peak: we used a window of 90-1 10 ms for the 
congruent illusion and illusion-failure percept types at the onset 
and offset; we used a window of 90-1 10 ms for the incongruent 
illusion-failure percept at onset and offset, and we used a win- 
dow of 86-106 ms for the incongruent illusion percept at onset and 
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FIGURE 2 | Trial depiction. Approximate timing of unfolding events during one trial. 
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FIGURE 3 | Group mean classification {illusion, illusion-failure) 
percentages and standard deviations for the congruent and 
incongruent conditions; illusion implies that the individuals classified the 
interrupted stimuli as continuous, while illusion-failure implies that the 
individuals classified the interrupted stimuli as interrupted. 



offset. For the P2 AEP at centro-parietal ROI, the window of anal- 
yses exhibited high variability between conditions: this resulted 
in using a window of 165-185 ms for the congruent and incon- 
gruent illusion-failure percept at the onset; we used a window of 
180-200 ms for the congruent illusion-failure percept at the offset, 
and a window of 195-215 ms for the incongruent illusion-failure 
percept at the offset; we used a window of 205-225 ms for the 
congruent illusion percept at the onset, a window of 140-160 ms 
for the incongruent illusion percept at the onset, a window of 
210-230 ms for the congruent illusion percept at the offset, and a 
window of 175-195 ms for the incongruent illusion percept at the 
offset. Similar values were obtained for the fronto-central ROI. 

STATISTICAL ANALYSES 
Behavior 

For each subject, we first calculated the percentage of illusion and 
illusion-failure responses in each congruency condition. We then 
normalized the percept classifications (number of responses of 
illusion or illusion-failure) to percentages of the overall response 
within a congruency condition. Then we conducted repeated 
measures analysis of variance (ANOVA) contrasting classifica- 
tion percentages across conditions with the independent vari- 
ables being congruency (congruent, incongruent) x percept type 
{illusion, illusion- failure). Post-hoc analyses were based on the 
Newman-Keuls test. 

EEG 

We conducted separate ANOVAs for the Nl and P2 AEPs (ampli- 
tude or latency), with the independent variables being percept 
type x time-locking x congruency. Post hoc analyses were per- 
formed using the Newman-Keuls test. 

RESULTS 
BEHAVIOR 

Inspection of the data from the calibration session (prior to EEG 
recording) showed that the duration of the noise interruption 
exceeded the fricative duration. The group average threshold for 
the perception of continuity (illusion) was 187mn ± 64s£>% of the 
average duration of the replaced phoneme, which translates to a 
group average duration of 281mn ± 965onis. This result shows 
that perception of the illusion was not confined to the phoneme 
on which the noise interruption was centered, but also extended 
to adjacent phonemes. 

Turning to the behavioral data collected during the EEG ses- 
sion, Figure 3 shows how often individuals classified the stim- 
uli as interrupted (illusion-failure) or continuous (illusion) for 
the congruent and incongruent conditions. Recall that the phys- 
ical attributes of auditory stimuli were identical in the two 



congruency conditions. Thus, any difference in classification must 
be due to a difference in perception, and cannot be due to physical 
differences in the stimuli. An ANOVA with variables congruency 
and percept type revealed a main effect of percept type [F(i_ n) = 
9.6, p < 0.01, r\p = 0.46] and an interaction between percept 
type and congruency [F(i t n) = 21.0, p < 0.005, Tip = 0.65]. The 
main effect of percept type was attributed to a greater number of 
trials labeled as continuous (illusion) than interrupted (illusion- 
failure). The interaction between the variables was attributed 
to a stronger illusion (a greater number of trials labeled as 
continuous vs. interrupted) occurring in the congruent than 
incongruent condition (congruent: p < 0.001; incongruent > 0.1; 
Newman-Keuls). 

EEG 

As a reminder, our hypothesis stated that the weakening of N1-P2 
AEPs to onsets and offsets of noise-interruptions during illusion 
vs. illusion-failure perception should be greater in the congru- 
ent than incongruent streams. Figure 4A shows the group average 
AEP waveforms of the illusion and illusion-failure percepts (super- 
imposed) in the congruent and incongruent conditions (averaging 
across onsets and offsets of interruptions). Figure 4B shows the 
waveforms when the data were further segregated according to 
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AEP amplitude: Effect of AV congruency on percept type 
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FIGURE 4 | (A) Group average AEP waveforms of the illusion (gray) and 
illusion-failure (black) percepts (superimposed) for the congruent and 
incongruent conditions averaged across onsets and offsets of 
interruptions. The waveforms represent the average across the 
centro-parietal channels (C1, C2, Cz, CP1, CP2, CPz, see black-outlined 
box of the middle N1 topography of the congruent condition). The 



topographies at the bottom represent the mean potential distribution over 
a 20ms window around the N1 peak. Dashed lines at Oms represent 
sound interruption onset and offset. (B) AEP waveforms and 
topographies of the N1 separated according to the onset and offset of 
interruptions time-locking conditions. (C) Bar graph depicting the mean 
N1 amplitudes and standard errors for all conditions. 



onset and offset conditions. They represent the average wave- 
forms across channels in the centro-parietal ROI shown in the 
black-outlined box of the middle Nl topography of the congru- 
ent condition. The channels comprising the fronto-central ROI 
are shown in the more anterior white-lined box. Below the wave- 
forms are Nl topographies (mean topographies of 20 ms around 



the peak) for the two percept types. In line with our predictions, 
notice, that in the waveforms of the congruent condition the Nl is 
more prominent for the illusion-failure (black waveforms) than 
the illusion (gray waveforms) percept. This relationship is not 
realized in the incongruent condition. These observations were 
tested statistically. 
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AEP amplitude 

We first conducted f-tests comparing fronto- central and centro- 
parietal topographies (collapsing across all conditions). The t- 
tests approached significance for the Nl AEP (p = 0.06) and was 
highly significant for the P2 AEP (p = 0.0001). These effects were 
due to greater Nl and P2 amplitudes observed at fronto-central 
ROI than centro-parietal ROI. These results warranted separate 
ANOVAs for the fronto-central and centro-parietal ROIs. 

Also, statistical analyses of the data at the fronto-central ROI 
revealed only a main effect of boundary for the Nl and P2 
AEPs (p < 0.05). The Nl and P2 AEP amplitudes were larger 
following the onsets than offsets of interruptions. Because there 
were no effects of the main variables of interest (e.g., congru- 
ency), we turned our attention to the centro-parietal region, 
where AEP amplitude differences reflected illusory perception 
and congruency effects. 

Nl AEP amplitude 

Figure 4C shows the bar graphs summarizing the ANOVA con- 
trasting the Nl amplitude across conditions at the centro-parietal 
ROI. An ANOVA on the Nl AEP, with variables percept type, 
time-locking (onsets and offsets), and congruency, revealed a 
main effects of percept type approaching significance [F^ nj = 
4.1, p < 0.07, rip = 0.27], which was due to smaller Nl AEPs 
occurring for the illusion than the illusion-failure percept across 
congruency and time-locking conditions. This result is in line with 
the findings of Shahin et al. (2012). There was also an interaction 
between the variables percept type and congruency [F(\ t n) = 
8.25, p < 0.02, ri p = 0.42]. Post hoc tests revealed that this effect 
was due to smaller Nl amplitudes occurring for the illusion than 
illusion-failure percepts only in the congruent, not incongruent, 
condition (Newman-Keuls, p < 0.05, Figure 4C). However, there 
also was an interaction among all three variables that further 
differentiated the Nl effect [F {h u) = 5.0, p < 0.05, r) p = 0.31]. 
Post hoc tests revealed that the Nl suppression distinguishing 
illusion from illusion-failure percepts in the congruent condition 
was greater at the offsets than onsets of interruptions (Newman- 
Keuls, p < 0.02, Figure 4C). In the incongruent condition, not 
only was the difference not reliable, but it was in the opposite 
direction. 

P2 AEP amplitude 

An ANOVA on the P2 amplitude data revealed only a main effect 
of percept type [F(\ t n) = 8.25, p < 0.02, rip = 0.42], which was 
attributed to smaller P2 amplitudes occurring for the illusion 
than illusion-failure percepts (mean and standard error of illusion- 
failure 0.59 ± 0.15; mean and standard error of illusion 0.44 ± 
0.16). This result is also consistent with the premise that weaken- 
ing of AEPs is consistent with illusory perception (Shahin et al., 
2012). 

Summary of AEP amplitude results. In sum, the Nl ampli- 
tude results support the premise that the neurophysiological 
basis for illusory perception — suppression of the AC response 
to interruption boundaries during continuity perception — was 
only observed when the speech was accompanied by meaning- 
ful speech-reading (congruent visual streams). This effect was 



localized to the centro-parietal portion of the scalp. In con- 
trast, the P2 AEP was not influenced by visual context, although, 
like the Nl AEP, its inhibition for the illusion vs. illusion-failure 
percepts indexed continuity perception. 

AEP latency 

Analyses of Nl and P2 latencies yielded no differences as a 
function of congruency, but yielded differences between percept 
types and time-locking conditions. Because the fronto-central 
and centro-parietal ROIs yielded qualitatively similar results, we 
report only the latency effects of the centro-parietal ROI. 

N1 AEP latency 

An ANOVA on the Nl latency data revealed only a main effect 
of time-locking [F(i t n) = 6.7, p < 0.05, r\ p = 0.37], which was 
due to shorter latencies occurring at the offsets than onsets of 
interruptions. 

P2 AEP latency 

An ANOVA on the P2 latency data revealed a main effect of 
percept type [F(i, nj = 181, p < 0.0001, Tip = 0.99] and a main 
effect of time-locking [_F (1 , u) = 154, p < 0.0001, rip = 0.93]. 
There was also an interaction between percept type and time- 
locking [F(i t nj = 18, p < 0.005, rip = 0.62]. Shorter latencies 
occurred for the illusion than illusion-failure percepts, but this 
difference was greater for the onsets than offsets of interruptions 
(p < 0.005). 

DISCUSSION 

Our study demonstrates that visual information provided by 
speech-reading reinforces inhibition of the AC response to noisy 
interruptions, thus increasing perceptual tolerance for degraded 
speech. The Nl results — suppression at interruption boundaries 
during illusory perception compared to when the illusion failed — 
replicate those of Shahin et al. (2012). However, this study reports 
a new finding: Nl inhibition is present during speech-reading 
of congruent but not incongruent audiovisual speech streams. 
This neurophysiological effect was also reflected behaviorally, 
whereby individuals classified the interrupted words as contin- 
uous (experienced the illusion) more often during the congruent 
than incongruent AV streams. A logical next question to ask is: 
What are the neural dynamics that facilitate visual enhancement 
of illusory filling-in 7 . Because we used EEC in this study, it is not 
feasible to identify all brain regions involved during the AV task. 
However, by integrating the current findings with a synthesis of 
the neural dynamics described in previous research, a plausible 
account can be offered. 

It has been well established that the Nl AEP represents neu- 
ral activity generated in the primary auditory cortex (PAC) and 
surrounding areas, such as the belt region of non-PAC (Scherg 
and Von Cramon, 1985; Pantev et al, 1995; Picton et al, 1999). 
Thus, the Nl suppressive effect can be explained as a decrease of 
neural recruitment and/or temporal alignment of neuronal fir- 
ings in PAC to stimulus boundaries. Therefore, we can conclude 
that this inhibitory process weakens this region's sensitivity to 
interruption onsets and offsets that do not conform to the fidelity 
(smoothness) of the speech envelope. This in turn heightens the 
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perception of speech continuity through the noise. This position 
is consistent with earlier conclusions from EEG and fMRI stud- 
ies on illusory continuity (Heinrich et al., 2008; Riecke et al., 
2009; Shahin et al., 2009), which reported that greater tolerance 
for degraded speech (enhanced continuity perception) co-occurs 
with decreased activity at PAC. 

Missing from the preceding account is an explanation of the 
relevance of visual information to the observed suppressive Nl 
effect during perception of degraded speech. Visually-mediated 
suppression of the Nl AEP is not without precedence. It has 
been well reported in prior studies, using non-noisy speech, 
that visual influence on AEPs is suppressive (Besle et al., 2004; 
van Wassenhove et al., 2005; Stekelenburg and Vroomen, 2007). 
However, at the same time there is strong evidence suggesting 
that vision recruits higher-level networks along the auditory sys- 
tem, i.e., non-PAC, during visual integration. The classical view 
posits that AV integration occurs via the posterior superior tem- 
poral sulcus-gyrus (pSTS-G) and associated inter-sensory regions 
such as the middle temporal gyrus (MTG) and intra-parietal 
sulcus (IPS) (Calvert and Campbell, 2003; Beauchamp et al., 
2004a,b; Miller and D'Esposito, 2005). These previous accounts, 
taken in combination with ours, lead us to posit that visually- 
mediated suppression of the Nl AEP may be part of an ascending 
shift of activity along the auditory system, primed by vision. 
More specifically, visual context predicts the unfolding cues in 
the speech signal (e.g., those revealing phonetic identity such as 
rhythm or formant transition) and causes auditory processing to 
be reweighted (re-routed) from low-level auditory networks (e.g., 
PAC, Nl inhibition) to high-level ones (non-PAC, excitation). 

The Nl suppressive effect may be related to the audiovisual 
system's ability to reduce phase resetting of ongoing AC oscilla- 
tory activity in the alpha or theta bands (Hanslmayr et al., 2007; 
Fuentemilla et al., 2009) along acoustic boundaries (Luo and 
Poeppel, 2007). This reduction in phase-resetting may enhance 
tracking of the speech envelope (Luo and Poeppel, 2007) through 
the interruptions, and hence augment illusory continuity. This 
premise supports the findings of Shahin et al. (2012), who found 
that a reduction in Nl amplitude at the onsets and offsets of 
interruptions was accompanied by reduced phase-resetting of 
theta band. 

This reweighting hypothesis fits within the auditory system's 
objective to efficiently organize incoming speech representations 
along the auditory system, such that contextual information pre- 
vails over transient spectrotemporal cues to ensure object recog- 
nition. Indeed, both animal and human neuroimaging studies 
have concluded that simple sounds are favorably processed at 
PAC; however as sounds become more meaningful (e.g., more 
structured and familiar), processing shifts to non-PAC (e.g., 
superior temporal sulcus/gyrus, middle temporal gyrus), and 
even higher-level areas (e.g., fronto-parietal and motor regions) 
(Rauschecker et al., 1995; Hickok and Poeppel, 2000, 2007; Tian 
et al, 2001; Wessinger et al, 2001; Patterson et al, 2002; Pasley 
et al., 2012). In light of these facts, we posit that vision must 
tap into these processes and reinforce the reweighting along the 
auditory system, allowing for complex auditory representations 
to be fused with visual representations. This is key to enhancing 
intelligibility across adverse acoustical situations. 



To put the above reasoning into the context of the current 
study, the visually primed inhibition at PAC commences imme- 
diately following the onset of mouth movements (prior to the 
onset of speech or interruption). By the time the noisy inter- 
ruption unfolds, the brain had already become less sensitive to 
simple features in sound, dampening the perceptual system's sen- 
sitivity to the onsets and offsets of interruptions. Supporting this 
premise is a study which reported that the observed Nl AEP 
suppression only occurred when visual anticipatory motion pre- 
ceded the sound (Stekelenburg and Vroomen, 2007). In other 
words, the Nl effect was achieved only when visual cues were 
contextually relevant to the auditory cues, consistent with our 
results of greater Nl suppression during the congruent vs. incon- 
gruent conditions. However, at the same time the inhibitory 
process commences at PAC, vision excites higher level auditory 
networks, so contextual knowledge (phonological/lexical) can be 
engaged to aid filling-in of missing phonemic representations. In 
other words, the low-level inhibitory and high-level excitatory 
mechanisms (the reweighting) work in tandem to fulfill illusory 
perception. This is in line with earlier studies on illusory filling-in 
which reported that this process is driven by higher-level neural 
networks in the superior temporal sulcus, angular gyrus, mid- 
dle temporal gyrus and inferior frontal gyrus (Heinrich et al., 
2008, 2011; Shahin et al, 2009), while simultaneously activity 
at PAC is weakened. It may be that visual context primes those 
regions (PAC as well as the higher-level ones), reinforcing illusory 
filling-in. 

We further posit that this visual influence is reinforced during 
adverse acoustic situations, in which phonetic and lexical infor- 
mation are not as clear as in quiet situations. This assessment is 
based on the finding that Nl suppression was most pronounced 
at the offsets, as opposed to the onsets, of interruptions. Our 
reasoning is that the onset of noise triggered greater reliance on 
visual modulation leading to increased tolerance for the unfold- 
ing noise, evidenced by greater suppression at AC at noise offset. 
This process may arise because of the growing necessity to encode 
the unfolding patterns of the speech envelop through visual mod- 
ulation in noisy situations. A recent study of visual influence 
on auditory speech stream segregation (i.e., cocktail party Zion 
Golumbic et al, 2013) supports this conclusion. The authors 
reported that speech envelope tracking of the visually-attended 
stream in the AC was stronger when mouth movements were 
absent (auditory- only stream). 

A caveat of the current experimental design relates to the 
Nl inhibitory results at the offset of interruptions distinguishing 
congruent and incongruent AV conditions. By temporally-locking 
AEPs to the offsets of interruptions we risked an overlap from 
preceding onset AEPs. Because the onsets and offsets of interrup- 
tions were separated on average by 28 lms, the onset of P2 may 
have overlapped with the offset of Nl. However, if we take the 
above group average values as a representative, the peaks of the 
onset P2 (~ 190 ms temporally-locked to onset of interruptions) 
and offset Nl (~ 100 ms + 281ms temporally-locked to onset 
of interruptions) would still be separated by about 190 ms. Thus, 
an overlap would be most likely between the leading tail of the 
P2-onset with the lagging tail of the Nl-offset. While this overlap 
effect may be small given that the window of analysis was ± 1 0 ms 
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around the peak, nonetheless we caution that the Nl-offset results 
could have been modulated by P2 onset. 

Our behavioral data are consistent with the view that 
visual context mitigates the disruptive effects of interruptions. 
Individuals perceived continuity for a larger number of words 
in the congruent than incongruent condition, suggesting that 
the visual context, when congruent, served to facilitate illusory- 
perception by reinforcing information common to both the ears 
and eyes. However, that illusory filling-in failed on congruent 
trials some of the time begs for an explanation. It is likely 
that some words contain phonemes or phonetic cues that are 
not as receptive to visual modulation as others. For example, 
the voiced/voiceless distinction is not conveyed visually, whereas 
place of articulation (labial vs. alveolar) is highly visible. These 
visually unreceptive cues are not limited to the fricatives/affricates 
originally replaced because the interruption covered on average 
190% of the duration of the phoneme, and thus extended to 
cover, in part or whole, adjacent phonemes. Moreover, because 
mouth movements naturally lead the unfolding speech, the types 
of phonemes that preceded the interrupted phoneme likely played 
a role in mediating visual influence. Neurophysiologically, this 
may explain why the visually-induced inhibitory effect on the Nl 
AEP was only observed during a successful illusion, not when 
the illusion failed (i.e., the visual context of the illusion-failure 
percepts were unhelpful for both the congruent and incongruent 
conditions). 

One outstanding question relates to the topographical dif- 
ferences in Nl AEP (Figure 4A). In the congruent condition, 
the illusion-failure's Nl is maximally exhibited at the center of 
the scalp, whereas the illusions Nl is more frontally located. 
The difference of these two Nl AEPs resulted in a central 
topography (maximum) at Cz. The auditory Nl is known 
to span centro-frontal regions and several generators in PAC 
and surrounding areas. By subtracting the Nl AEPs of the 
illusion and illusion-failure percepts, we may have identified 
a region of the auditory cortex that corresponds to acoustic 
onsets and offsets. In a similar experimental design (Shahin 
et al., 2012), this region was localized to the middle por- 
tion of Heschl's Gyrus (PAC) using fMRI. Being localized to 
PAC as opposed to non-PAC is consistent with the earlier 
latency observed for the subtracted Nl (80 ms), hence reflect- 
ing earlier processes along the auditory pathway (Figure 4A). We 
should note that the mismatch of the difference topographies of 
the congruent and incongruent conditions suggests that differ- 
ent auditory generators underlie the congruent and incongruent 
effects. 

In conclusion, our findings support the hypothesis that visual 
context via speech-reading weakens representations of interfer- 
ing (non-conforming) signals (noisy interruptions). This could 
be due to a shift in processing toward high level auditory networks 
to take advantage of more complex acoustic features in speech. 
Our Nl result, along with prior research, begins to elucidate the 
neural mechanisms of AV integration of degraded speech and 
suggests avenues for further investigations. Namely the hypoth- 
esis can benefit from further investigations that manipulate the 
phonemic clarity of the visual information (e.g., sensitivity of 
replaced phoneme or preceding phonemes to visual influence) 



while simultaneously probing the behavior of low and high level 
auditory networks. 
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